HD-fllCC  183 

DISTRIBUTED  DATA  STRUCTURES;  A  CASE 

STUDV(U>  ROCHESTER 

1/1 

UNIV  NV  DEPT  OF  CONPUTER  SCIENCE  C 
TR-138  NeeS14-82-K-81S3 

S  ELLIS  RUG  85 

UNCLRSSIFIEO 

F/Q  9/2 

NL 

AD-A166  083 


4* 


Distributed  Data  Structures:  A  Case  Study 

Carla  Schlalier  tills 
Computer  Science  Department 
University  of  Rochester 
Rochester.  NY  14627 

TR150 
August.  1985 


DTIC 

lELECTE 

MAR  3  1 1986 


3 


Department  of  Computer  Science 
University  of  Rochester 
Rochester,  New  York  14627 


Appi«v«d  lot  pubUo 
'  PtotitbuQoo  OnhalMd 


3  17  158 


Distributed  Data  Structures:  A  Case  Study 

Carla  Schlatter  Ellis 
Computer  Scient^  Department 
University  of  Rochester 
Rochester.  NY  14627 

TR150 
August.  198S 

DTIC 

SELECTEI 
MARS  11986 

Abstract  ® 

•J 

In  spite  of  the  amount  of  v'ork  recently  devoted  to  distributed  systems, 
distributed  applications  are  relatively  rare.  One  hypothesis  to  explain  this  scarcity  of 
different  examples  is  a  lack  of  experience  with  algorithm  design  techniques  tailored 
to  an  environment  in  which  out-of-date  and  incomplete  information  is  the  rule.  Since 
the  design  of  data  structures  is  an  important  aspect  of  traditional  algorithm  deugn, 
— we-feeHharit  i^important  to  consider  the  problem  of  distribute  data  structures.  In 
diis  papery  we  mvestigate^these  issues  by  developing  a  distributed  version  of  an 
extendible  ludh  file  wluch  is  a  dynamic  induing  structure  diat  could  be  useful  in  a 
disuibuted  database.  .  - - ;  ^ 
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1.  IntroductMHi 

There  is  currently  a  significant  amount  of  work  being  done  in  the  area  of 
distributed  systems.  Amo^  the  motivations  usually  cited  for  the  use  of  a  distributed 
systra  are  ease  of  expansion,  increased  reliability,  actual  geographic  distribution,  the 
ability  to  incorporate  heterogeneous  resources,  and  resource  sharing  among 
autonomous  sites.  In  spite  of  this,  distributed  applications  are  relatively  rare.  By  this 
we  mean  problems  that  actually  exploit  some  aspect  of  distribution  and  have  been 
solved  by  us^tevel  distributed  programs.  As  an  example,  one  can  easily  imagine 
problems  requiring  the  computational  power  of  a  supercomputer  along  with  an 
attractive  user  interface  using  the  window  package  of  a  personal  workstation  that 
would  benefit  from  the  ability  to  incorporate  both  kinds  of  machines  into  the 
solution.  There  are  a  number  of  hypotheses  to  explain  the  scarcity  of  examples 
including  inadequate  performance  in  networks  and  lack  of  programming  language 
support  A  more  important  problem  may  be  a  lack  of  experience  with  algorithm 
d^gns  that  tolerate  inaccurate  and  inconsistent  data.  It  iq>pears  to  be  a  fundamental 
characteristic  of  distributed  computations  ttiat  no  one  component  can  easily  gather 
knowle^e  of  the  true  instantaneous  global  state  of  the  system.  Thus,  out-of-date  and 
incomplete  information  is  inevitable.  The  purpose  of  our  reseat^  has  been  to 
investigate  distributed  programming  tediniques  that  acknowledge  diis  principle. 

Since  the  design  of  the  data  structures  is  an  important  aq>ect  of  traditional 
algorithm  design,  we  feel  that  it  is  valuable  to  consider  the  problem  of  distributing 
data  structures.  For  our  pufp<^  a  distributed  system  is  modeled  as  a  number  of 
logical  processors  ccxnmunicating  solely  dirou^  port-based  asyndironous  messi^e- 
passing  in  the  style  of  [Rashid  M].  There  is  no  memory  shared  among  these  logical 
processors.  A  logical  processor  may  encompass  multiple  processes  that  execute  on  die 
same  physical  processor  and  may  share  data  among  diemselves.  The  phrase 
"distributing  a  data  structure”  means  that  diere  are  a  number  of  logical  processors 
each  encapsulating  some  portion  of  a  single  coherent  data  structure  and  acting  as  a 
manager  for  that  piece.  The  data  structure  may  eidier  be  divided  into  dhijoint 
portions  or  scnne  parts  may  be  replicated  in  sevend  managers.  Replication  may  serve 
to  increase  availability  of  the  dam  structure  when  processors  can  fiul  or  to  improve 
performance  by  allowing  more  concurrency  through  a  botdeneck  of  the  structure  or 
by  placing  cc^ies  of  heavUy  used  inftmnation  at  user’s  sites.  Sudi  replication  raises 
iht  issue  of  maintaining  consistency  to  an  appropriate  degree.  Although  a  number  of 
general  purpose  mutual  consistency  algcmdi^  are  available  [Gifford  79,  StondMaker 
79,  Thomas  79],  often  it  should  be  possible  to  exploit  certain  prop^es  of  the 
specific  problem  at  hand  to  arrive  at  a  less  synchronized  method.  In  this  p^qia*,  we 
investigate  these  issues  by  developing  a  distributed  version  of  a  particular  indexing 
structure. 

1  A  Distributed  Version  of  Extendible  Hashing 

Hashing  has  long  been  recognized  as  a  fast  method  for  accessing  records  by  key 
in  large  relatively  static  databases.  However,  when  the  amount  of  data  is  likely  to 
vary  significantly,  traditional  hashing  can  suffer  from  performance  d^radation  and 
may  eventually  require  rehashing  all  the  records  into  a  larger  space,  foitendible 
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hashing  [Fagin  79]  is  one  of  a  number  of  recently  developed  hashing  schemes 
(Larson  78,  Litwin  80,  Lomet  83,  Litwin  78]  that  can  grow  and  shrink  in  response  to 
insertion  and  deletion  operations.  A  distributed  system  can  provide  the  growdi  in 
resources  to  accommodate  such  growth  in  the  data  structure.  Thus  it  makes  sense  to 
investigate  how  to  partition  an  extendible  hash  file  among  the  sites  in  a  distributed 
environment  In  addition,  availability  considerations  demand  that  any  data  structure 
used  as  an  index  for  a  distributed  database  be  itself  distributed  and  possibly 
replicated.  Finally,  it  appears  to  be  relatively  easy  to  distribute  components  of  an 
extendible  hash  Me  in  sudi  a  way  that  operations  involve  as  few  sites  as  possible. 

The  sequential  algorithms  for  extendible  hashing  are  described  in  [Fagin  79].  The 
basic  ideas  and  terminology  are  summarized  faNclow.  The  data  structure  consists  of 
two  paits:  a  set  of  buckets  and  the  directory.  The  buckets  reside  on  secondai7  storage 
and  contain  keys  and  associated  information.  The  order  of  die  data  within  buckets  is 
not  important  for  this  discussion.  The  directory  is  an  array  of  pointers  to  buckets.  A 
hash  function  is  used  that  generates  a  very  long  pseudokey  when  applied  to  a  key. 
The  number  of  bits  of  the  pseudokey  actually  mtd  to  index  into  die  directory  is 
called  the  depth  of  the  dire^ry  and  ch^es  as  die  file  ^ws  or  shrinks.  In  our 
work,  the  least  significant  bits  are  used  in  order  to  simplify  manipulations  of  the 
directory.  Suppose  diat  the  directory's  depth  is  currendy  three.  Thk  means  diat  at 
the  moment,  there  are  eight  valid  dfosctory  entries.  The  ent^,  0  ^  ^  7,  points  to 
the  bucket  that  holds  all  the  records  whose  pseudokeys  end  in  foe  three  bit  binary 
representation  of  L  Each  bucket  includes  a  localdepth  depth)  indicating  that  foe 
pseudokeys  of  foe  records  it  contains  s^ree  in  only  that  number  of  bits.  Thus 
multiple  directory  entries  will  point  to  foe  same  bucket  if  its  localdepth  is  less  than 
foe  directory's  depth.  Figure  1  gives  an  example  of  an  extendible  hash  file  for 
sequential  access.  To  perform  a  find  operation  for  a  key,  k,  one  would  apply  foe  hash 
function  to  k  to  obtain  foe  pseudokey  (imagine  it  is  *...101'),  determine  foe  current 
depth  of  foe  directory  (2  in  this  example),  and  use  foe  appropriate  bits  ('01'),  as  an 
index.  Following  foe  pointer  in  foe  directory  entry,  one  would  search  foe  third 
budtet  for  k.  As  insertions  occur,  a  bucket  may  become  fiiU  (indicated  by  foe  count 
field)  and  split  into  two  buckets.  If  foe  old  localc^fo  equals  depth,  foe  directory 
doubles  in  size  and  depth  increases  by  one.  Similarly,  deletions  may  result  in  two 
budcets  merging  and  possibly  reducii^  foe  depth  of  foe  directory.  One  way  of 
d^ecting  foe  condition  that  allows  halving  foe  size  of  foe  directory  is  to  keq)  a  count 
(named  depthcount)  of  foe  number  of  burets  whose  localdepth  equals  depfo.  Figure 
2  shows  how  a  sequence  of  updati^  q>eratkMis  would  affect  the  structure  given  in 
Figure  1  where  x  <  y  s  z  ^  maximum  numbm  of  keys  dlowed  in  a  bucket  This 
diii  structure  is  our  point  of  departure  for  devefoping  a  distributed  solution.  The 
obvious  partitkming  calls  for  two  qrpes  of  logical  ptooemon,  namely  directory 
managers  foat  are  respcmsible  for  rq>licas  of  die  direclory  component  and  bucket 
managers.  Each  bucket  manager  is  reqxmsible  for  a  (fi^mnt  subset  of  the  budtets. 

Ihe  distributed  solution  is  derived  from  a  scrtution  allowii^  ooncurrent  access  to  a 
shared  centralized  extendible  hash  file  [Ellis  83].  That  solution  is  baaed  mi  fodting 
protocols  and  modifications  in  foe  data  structure  to  allow  for  concurrent.  Additimia 
modifications  are  introduced  here  to  improve  locality  and  allow  replication  of^foe 
directory  component  The  fundamental  change  from  foe  sequential  version  is  foat  the 
burets  are  linked  forou^  a  next  field  to  allow  recovery  frcmi  concurrent 
restructuring  operations.  This  provides  an  alternate  path  to  foe  desired  data  foat  can 
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be  used  by  a  searching  operation  when  the  information  is  being  moved  in  a  split  or 
merge  operation.  The  approach  is  similar  to  the  use  of  link  pointers  in  Lehman  and 
Yao's  B™*^'tree  solution  [Lehman  81],  When  a  bucket  splits,  the  next  link  of  the 
original  bucket  is  reassigned  to  point  to  the  newly  created  bucket  The  new  bucket 
gets  the  original  bucket's  old  next  pointer.  Merging  does  the  reverse.  The  next 
pointer  is  also  used  for  recovery  through  deleted,  but  not  yet  deallocated,  buckets. 
Deleted  buckets  and  discarded  halves  of  the  director  are  actually  deallocated  only 
after  ensuring  that  they  are  no  longer  needed.  In  addition,  tiiere  must  be  a  way  for  a 
bucket  manager  performing  the  search  phase  of  a  transaction  to  tell  if  it  has  read  the 
wrong  bucket  We  chose  to  include  a  field  (commonbits)  containing  the  common  bit 
pattern  that  diaracterizes  the  pseudokeys  Aat  belong  in  the  bucket  Alternatively, 
one  could  reapply  the  hash  function  to  any  key  stored  in  die  bucket  and  use  this  for 
comparison  wiA  the  target  pseudokey  as  long  as  the  possibility  of  an  empty  bucket  is 
taken  care  of.  "Wrong  bucket"  includes  the  case  where  the  bucket  has  been  merged 
into  a  preceding  bucket  That  bucket  is  marked  as  "deleted"  (using  commonbits 
field).  A  prev  Ui&  has  been  added  to  each  bucket  that  leads  to  the  bucket  from  which 
this  bucket  originally  split  off.  This  information  which  is  local  to  the  bucket  manager 
is  used  to  simplify  finding  the  partner  bucket  for  a  possible  merge.  Each  link  now 
represents  a  pair  consisting  of  a  long-lived  identifier  for  a  manager  port  and  a  bucket 
address  that  is  meaningful  to  that  manager.  A  version  field  introduced  into  each 
bucket  and  each  directory  entry  is  used  in  updating  directory  copies  asynchronously. 
The  resulting  data  structure  appean  in  Figure  3.  Two  copies  of  the  directory  are 
shown  in  that  figure.  Note  that  this  example  represents  a  consistent  state  with  no 
update  operations  in  progress. 

The  main  purp^  behind  the  modifications  is  to  make  it  po^ible  to  tolerate 
inconsistencies  and  inaccuracies  in  the  directory  data.  In  order  to  gun  some  intuition 
for  these  structural  dianges,  consider  the  configuration  shown  in  Figure  4.  There  are 
two  active  update  operations:  an  insertion  of  a  record  with  pseudokey  '....  (X)'  that  has 
just  caused  a  split  and  the  deletion  of  the  only  record  left  with  pseudokey  of  the 
form  *....  ir  causing  a  merge.  The  top  copy  of  tire  directory  has  not  yet  recorded  the 
effect  of  the  split  and  the  bottom  copy  do^  not  yet  reflect  the  merge.  Suppose  there 
is  a  find  operation  for  pseudokey  '....  10’  directed  at  die  topmost  directory.  The  first 
bucket  retrieved  is  the  wrongbucket  as  indicated  by  die  comparison  of  the  pseudokey 
and  commonbits  and  the  search  continues  widi  die  next  bucket  whidi  turns  out  to  be 
the  desired  one.  Similarly,  consider  a  search  for  p^dokey  ’....  11’  directed  at  the 
bcxtom  copy  of  the  dir^loiy.  The  first  bucket  read  is  marked  as  deleted  and  die  next 
link  leads  to  the  appropriate  bucket 

The  actions  taken  by  the  managers  in  re^nse  to  messages  received  are  discussed 
below.  Figure  S  shows  the  message  types  that  flow  between  die  various  managers. 
The  information  contained  in  these  messages  is  outlined  in  Figure  6.  A  condensed 
version  of  the  procedure  for  the  directory  manager,  written  in  a  C-like  syntax 
[Kemigan  78],  is  given  in  Figure  7.  The  directory  manager  is  presented  here  as  a 
server  capable  of  tmdling  multiple  user  requests.  The  bucket  manager  is  written  as  a 
front  end  process  diat  serves  as  the  initial  contact  for  its  set  of  buckets  and  a  set  of 
associated  processes  that  reside  at  the  same  rite  and  share  secondary  memory.  The 
pseudo-code  for  these  processes  is  given  in  Figure  8. 
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Figure  3  Distributed  Extendible  Hash  File 
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Figure  4  Out  of  Date  Distributed  Hash  File 
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Figure  5  Managers  and  Message  Flow 
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Figure  7  Pseudocode  for  Directory  Managers 


Notttioa 


C-Wke  italewwts: 

Bnglith-Hkt  ptm/doeod»  tlalements: 
/•coamienis*/ 


wMlatlnM)  { 

miMioi’-  -  GaiMaasage  (amag): 

/*  Eilto  leccivna  mcMate  or  takes  a  mrutsr  off 
die  lUt  of  delayed  but  now  ready  directory  updates.  */ 
awNcti  (maaaaoaid)  { 

caaa  raquaat-  /*  ffom  user  */ 
raadcourM  >  raadcount  *  1; 

/‘number  of  transactions  in  progress*/ 

Calcuiara  paaudokay  and  loctia  current 
incernetion  of  the  bucket  meneger  responsible 
tor  desired  bucket: 

Generate  Irensection  #  and  save  state  related 
to  this  request; 

Construct  end  send  a  "tind".  "insert"  or 
"delete"  message: 


tt  (  oparabon  -  -  inaart)  { 

Apply  appropriate  updates 
to  toeal  copy  ot  directory; 

M  (Imag  juccaaa)  ( 

Try  again; 

) 

aiaa  ( 

raadcount  >  raadcount  - 1; 
ClaanStata  (mag.tranaaction  # ): 

} 

> 

alaa  {  /•  op  =  delete  •/ 

Record  location  ol  deleted  bucket  lor  the 
eventual  garbage  collection  phase: 

Apply  local  directory  updates: 

> 


caaa  buckatdona:  /*  from  bucket  manager  • 
no  directory  update  needed  */ 

naaliiiaTiaia  (mag.tranaaction  « ); 

/*  Recall  context  for  this  request  */ 
g  (Imag  auc caaa  ••  oparabon  ■  ■  dalata)  { 
Try  again:  locate  bucket  manager  again 
and  reissue  "tind".  "insert"  or  "delete" 
message: 

) 

alaaf 

landoount  ■  raadcount  - 1; 
naanTlata  (mag.tranaacbon  # ); 

/*  totget  about  this  request  */ 

) 


/*lf  finishing  this  directory  update  enables  previously 
delayed  ones,  make  them  accessible  to  GetMessage  */ 


caaa  updala:  /*  directocy  update  at  dacctory  manager 
that  initially  handled  request  */ 

RaMonAaln  (mag  tranaacbon  #); 

Send  a  eepyupdate  message  to  ak  other  directory 
managers  and  iiKrement  copycount  lor  each 
euiaiandfng  directory  update 
■  (VaraionaDoNotMatch(mag)) 

/*  eompmea  version  numben  in  mniigr  with 

eewioa  aambers  in  corresponding  directofy  entries  */ 

OalnyUpdalodnag); 

alaaf 


te  copyupdala:  /*  horn  other  directory  managers  */ 
tt  (VaraionaDoNotMatch(mag)) 

DalayUpdaia(mag): 

tt  (mag.op  ■  ■  inaart)  { 

Apply  local  directory  updates; 
SandAck(mag.acKp^):  /*  respond  to 
directory  manager  who  initiated  this  update  */ 

) 

alaa  {  /*  op  =  delete  */ 

Apply  local  directory  updates: 
RamambarAck(msgadtpon): 

/*  save  up  acks  until  the  equivalent  of 
exclusive-lacking  occurs  */ 

> 

RaloaaaSavadO; 


I  ack ;  copycount  -  copycount  ■ 


/*iend  acks  saved  by  deletioa  copyupdates  */ 
tt  (baadcounl  fifi  loopycount)  (karbagaCoHacK) 
/*get  rid  of  buckeo  ddcied  through  this  directory 


■ 


k!i 


Figure  8  Pseudocode  for  Bucket  Muuigers 


BiKkct  Muattr  Front  End  Piocck; 
wtiilo  (triM)  { 

tnnwngeid  -  rocoivomnningo  (Amng): 

if  (mminpiitl  -  -  ^HbuckM)  {  /*  from  another  bucket 

manager  with  no  available  ^ace  */ 

Aitoealt  availnMt  pogo; 
puttMCkot  (nevvpaoo,  inag.lMif2): 

SnnO  ’Splitfttply"  mossagn  containing  link  to  now  bucket; 

} 

olaef 

Cnata  a  bucket  elave  procaaa  and  tonaard  mag  to  It; 

) 


■MkctSbmPracem: 

miwnniiit  -  roemvomonango  (dmsg): 

H  (meenngnid  -  ■  wrongbuckot)  aw  >  mag.op; 
alaaaw  ■  maaaagaid: 
awitch(aw)  { 

caaaHnd:  okfpage  -  mag.page: 

RaadLock  (oldpaga); 
g  (maaaagaid  ■  ■  wrongbucket) 

Sand  ‘Ack"  to  bucket  manager  holding 
previoua  bucket;  /'allows  it  to  unlock  */ 
ataa  Send  aucceaalui  "Bucketdone"  metaage; 

/‘tells  dirmtory  manager  that  no  update  is  needed  */ 
gatbuckat  (oldpaga.  curranl). 
onmachina  ■  trua; 

/‘Follow  next  links  until  current  is  the  right  bucket:'/ 
athila  (current  it  wrong  bucket  Ad  onmachirta)  { 
nawpaga  ■  currant  •>  next; 
machina  >  currant  •>  naxtmgr; 

B  (machina  I  ■  ma)  {  /'  next  bucket  is  remote  '/ 
Send  "Wrongbucket"  message  to  nest  bucket 
manager; 

onmachirta  >  lalaa: 

1 

alaa  {  /'  next  bucket  is  local  '/ 

RaadLock  (nawpaga), 
gatbuckat  (nawpaga.  currant): 

UnRaadLock  (oldpaga): 
oldpaga  -  nawpaga: 

} 

■  (onmachina)  { 

N  (aaarch  (currant,  mag-kay))/'  is  key  there? '/ 
found  (mag-kay): 

notlound  (mag.kay): 

alaa  laiaiiiatmaaagf  (Amag):  /*  Wrongbucket  reply  '/ 
UnRaadLock  (oldpaga); 


I  inaart:  onmachina  ■  trua: 


SalacMvaLoch  (oldpaga): 

H  (maaaagaid  -  ■  wrongbucket) 

Send  "Ack"  to  previous  bucket  manager; 
gatbuckat  (oldpaga.  currant): 

Follow  nest  links  until  current  it  the  right  bucket 
(at  in  find  cate  except  use  Selective  locks  instead  of 
Read  locks); 

>  (lonmachino)  ( 

taoohmmaaaaga(Amog);  /'  Wrongbucket  reply  */ 
UnSotacUvaLocIi  (oldpaga): 

) 

aiao{ 

H  (aaarch  (cunant.  mag.kay))  {/‘is  key  already  there?'/ 
auccaas  •  feua: 

Send  "Bucketdone"  measage: 

UnSatacbudLock  (oUpaga): 

) 

alaa  H  (curraitt ->  count  I  -  numartirioa)  { 

/'  current  backet  not  hill  '/ 
aucooaa  -  arua; 

Send  "Bucketdone"  mettage; 
add  (currant,  mag.kay): 

/‘inserts  kqt  into  current  buffer  '/ 
putbuckatlDldpago.  currant): 

UnSalocbwaLock  (oMpaga): 

> 

alae  (/‘current  is  AiD  -  dircctoiy  will  be  dfcned '/ 
auccaas  >  split  (current,  halft.  half2,  msg-hay): 
/‘distributes  the  contenu  of  the  cunem  bucket  into 
2  buffers  pointed  to  by  half!  and  half2; 
if  room  avalable.  inserts  key  into  appropriate  half 
and  returns  Bue:  otherwise  returns  Mae  '/ 

H  (AvailablaPagasO)  { 
nawpaga  -  oMocbuckat  (); 
machina  ■  myid: 
putbuchst  (nawpaga.  half?): 

> 

alaa  {/'  no  available  pages  locally  */ 

Send  "Splitbucket"  message 
containing  contents  of  new  bucket 
to  a  managar  with  space; 
racalvamaianga  (Amsg):  /‘split  bucket  reply'/ 
machino  -  mag  buckatmgr: 


hain  •>  naxt  -  nawpaga: 
haffi  ■>  naxtmgr  ■  machina. 
pugniokat  (oldpaga.  half  1): 

UnSalacllvaLock  (oWpaga): 

Sand  "Update"  metaage  to  originating 
dheetoiv  manager  telling  It  to  update  dfractory: 


Figure  8  (continued) 


caMdaict*: 

Find  th»  tight  buchnt  as  in  tha  beginning  ol  insen 
eseepi  place  Bxciusiwe  locks; 

H  (!onmacMn«)  { 

rvcaivtniMMge  (Amcg):  /*  Wrongbuckei  ick  */ 
UnExcknivaLock  (oMpage); 

> 

•IM  { 

U  (current  bucket  will  not  be  lelt  "too  empty" 
as  a  result  ol  deleting  mag. key)  ( 

Send  eucceaalui  "Bucketdone"  message; 

M  (ramova  (mag.kay.  currant)) 
putbudiai  (oldpaga,  currant); 

UnExduaivaLock  (oldpaga); 

) 

alaa  {  /'Mcrgiiii  partner  buckets  is  called  for*/ 

H  (mag.kaK  is  in  first  bucket  ol  the  pair)  { 
nawpaga  ■  currartt  •>  rtaxt 
machina  *  currant  ->  naxtmgr; 
if  (machina  ■  ma)  { 

Merge  on  site; 

) 

alee  (/*  partner  is  remote  */ 

Send  "Mergedown"  message  to 
partner's  bucket  manager; 

•acaivamassaga  (dmag); 

/*  Mergedown  Reply  expected  */ 

V  (mag.succass)  {/*OK  to  merge 
(i.e.  localdepths  match); 
coniems  of  partner  in  msg  */ 

Construct  merged  bucket 
in  current  buder; 
putbuckat  (oldpaga.  currant); 

Send  successful  "Update"  message; 

) 

alaa  {/*  simply  remote  record  */ 

Send  successlui 
"Bucketdone"  message. 

H  (ramova  (z.  currant)) 

putbuckat  (oldpaga.  currant); 

) 

UnExduaivaLock  (oldpaga): 

) 

} 

alaa  (  /*  img.key  in  sccoad  of  pair  */ 
nawp^  -  currant  •>  prav; 
machirta  ■  currant  •>  pravmgr; 
UnExduaivaLock  (oMbago); 
if  (machirta  ■  •  ma)  ( 

Merge  on  site; 

) 

alae  (/*  partner  is  remote  */ 

Sand  "klergeup"  massage  to 
partnar'a  bucket  manager; 
raoalwamaaaaga  (Emag); 

/*  McrgeUp  Reply  expected  */ 
lf(lmaemiooaaB)( 

/*  not  mcrgablc  ■  davly  remove  record  */ 
Sand  aucceasM 
"Bucketdone"  message; 

H  (tmneua  (z,  eurrant)) 

pirtbuchaf  (aUpaga,  currant): 

) 

alaa{/*  apparently  aietfable  Rom 
partner's  point  of  view-  cfieck  more  locally  */ 
ExduaivaLock  (oldpaga): 
gatouckat  (oldpaga.  currant); 


H  (key  to  be  deleted  no  longer 
belongs  in  current  bucket)  { 
UnExduaitmLock  (oldpaga); 

Send  "Goehead"  massage  to  partner 
with  auccaaa  held  set  to  false; 
/‘cancels  merge  */ 

Sand  "Bucketdone"  message 
with  success  «  false; 

/‘tell  direcaary  manager  to  retry  */ 

) 

alaa  if  (curreid->iocaidepth 

does  not  match  loeaideplh  In  msg  | 
eurrant  tro  longer  "too  empty")  { 

Send  aueceaaful  "Bucketdone" 

message; 

g  (ramoM(z.  currant)) 

pidbMOkal  (oldpaga.  currant); 
UnExduaivaLock  (oldpaga); 

Sand  "Ooaltaad"  message 
wkhsuecesa  *  false; 

/‘caned  merge  ‘/ 

> 

aiaa{ 

Send  successlui  "Goehead" 
message  to  partner; 

/‘tell  panner's  manager  to  merge  */ 
currant  •>  naxt  >  currant  ->  prav; 
currant  ->  naxtmgr  ■ 
currant  ■>  pravmgr; 
currant  ■>  oommonbits  >  dalaled: 
putbuckat  (ddpaga.  currant): 
Send  successlui  "Update" 
message; 

>)>)))) 

braak; 

caaa  margadown:  nawpaga  >  mag.partnar; 

ExduaivaLock  (nawpaga); 

gafeuckat  (nawpaga.  brothar): 

auccaaa  ■  brothar  •>  iocaidapth  ■  >  mog.localdapth: 

Sand  "UargaDown  Baply"  to  partner; 

H(iuccaaa){ 

brothar  ->  cotnmonblta  -  dalatad; 
brothar  •>  naxt  -  brothar  ■>  prav; 
brothar  •>  naxtmgr  ■  brothar  ->  pravmgr; 
pultMickot  (nawpaga.  brothar); 

UnExduaivaLock  (nawpaga). 
braak; 

caaa  margaup:  nawpaga  •  mag  parthar; 

ExduaivaLock  (nawpaga); 

gaftuckat  (nawpaga.  brothar); 

auccaaa  -  (broitar  •>  naxt  -  -  mag.targat)  AS 

(bredtar  •>  naxtmgr  •  -  mog  managarld); 

Send  "HfergeUp  Baply’; 

V(WGoaaa)( 

rooalvennaaHga  (Emag):  /*  'GoAhead"  expected  ‘/ 
R  (magauccaaa)  (  /*  metr 

Conetrvet  merged  bucket  in  brofftar; 
puttuekat  ( nawpaga.  brodiar); 

) 

UnExduaivaLock  (nawpaga); 
bra*; 

caaagwbagacollad: 

tor  each  page  m  mag.ilst  { 

ExduaivaLock  (paga); 
daallocaia(paga). 

UnExduaivaLock  (paga): 
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In  the  centralized  solution,  the  directory  component  was  locked  during  the 
search  for  the  target  bucket  to  prevent  interference  ^tween  searches  and  deletions. 
A  deleting  process  placed  an  incompatible  lock.  If  the  deleter  did  not  exclude  the 
reader  and  was  in  the  process  of  halving  the  directory,  the  reader  might  have 
attempted  to  access  an  invalid  directory  entry  based  on  the  old  value  of  depth.  A 
similar  interference  could  occur  between  readers  and  deleters  with  regard  to  recently 
deallocated  buckets.  The  locking  of  the  directory  in  the  centralized  solution  translates 
into  the  manager’s  explicit  scheduling  of  requests  for  its  attention  in  the  distributed 
version. 


A  user  wishing  to  perform  an  operation  on  the  distributed  hash  file  may  contact 
any  directory  manager  with  a  request  message.  Upon  receiving  the  request,  die 
manager  saves  some  state  about  the  desired  operation,  does  the  directory  lookup,  and 
forwards  the  request  to  the  bucket  manager  indicate  After  forwarding  the  request, 
the  directory  manager  can  service  another  message.  While  a  request  is  outstanding, 
the  manager  delays  deallocation  of  deleted  components  that  the  request  may  be 
depending  upon. 

The  forwarded  r^uest  is  eventually  received  by  the  bucket  manager  front  end.  A 
new  slave  process  is  created  for  each  request  requiring  service  from  die  bucket 
manager  (with  the  exception  of  an  off-site  split  which  is  handled  by  the  remote  fiont- 
end).  The  slave  processes  associated  with  a  bucket  manager  can  manipulate  the  data 
in  buckets  belonging  to  this  manager  after  locking  the  bucket  and  transferring  the 
information  into  private  buffers.  The  buckets  are  assumed  to  occupy  physical  pages 
on  disk  which  are  read  and  written  as  single  operations.  The  locl^g  protocol  uses 
various  types  of  locks  placed  on  individual  buckets.  The  compatibility  of  lock  ^pes  is 
given  by  the  following  table. 


read-lock 

selective-lock 

exclusive-lock 

read-lock 

yes 

yes 

no 

selective-lock 

yes 

no 

no 

exclusive-lock 

no 

no 

no 

If  the  request  message  calls  for  a  find  operation,  a  read-lock  is  placed  on  the 
target  bucket  For  an  insert  operation,  the  slave  process  places  an  selective-lock  and 
for  a  delete,  an  exclusive-lock. 

Upon  reading  the  data,  the  process  may  discover  that  it  has  the  wrong  bucket 
This  means  that  a  split  or  merge  has  occurred  that  was  not  yet  reflected  in  the  copy 
of  the  directory  that  was  read.  In  other  words,  now  the  localdepth  low  order  bits  of 
the  target  pseudokey  do  not  match  the  commonbits  of  this  bucket  By  following  the 
next  pointer,  the  right  bucket  will  eventually  be  found.  The  next  bucket  is  always 
locked  prior  to  releasing  the  lock  on  the  current  bucket.  This  flow  of  locks,  known  as 
lock-coupling,  prevents  processes  from  leapfrogging  each  other.  If  the  next  bucket 
belongs  to  a  different  bucket  manager,  a  wrongbuckei  message  is  sent  and 
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acknowledged  before  the  lock  is  released.  Once  the  right  bucket  is  found,  the  desired 
operation  is  performed  and  finally  a  response  sent  to  the  directory  manager  that 
initially  handled  the  request  Lock  incompatibilities  prevent  interference  among 
updates.  An  insert  or  delete  operation  may  result  in  a  spUtting  or  merging  of  buckets. 
Off-site  flitting  may  be  necessary  if  there  is  a  shortage  of  available  buckets  locally. 
Off-site  merging  occurs  when  the  partner  bucket  belongs  to  a  different  manager. 
Protocols  are  available  to  handle  these  situations  {splitbuckeu  mergedown,  and 
mergeup  messages  and  associated  replies).  If  a  merge  operation  appear  to  be 
appropriate,  the  partner  bucket  can  be  determined  using  local  information  (i.e.  either 
next  or  prev  links).  In  the  centralized  algorithms  it  was  acceptable  to  locate  a  partner 
bucket  using  the  directory.  In  the  distributed  case,  thts  would  have  involved 
additional  message  traffic  for  a  bucket  manager  to  send  an  inquiry  message  to  a 
directory  manager  and  wait  for  a  reply.  In  order  to  avoid  deadlock,  the  partners  for  a 
merge  must  be  locked  accord!^  to  the  ordering  imposed  by  next  links.  If  it  is 
necessary  to  lock  the  bucket  pointed  to  by  prev,  die  lock  on  the  target  bucket  must 
first  be  released  and  a  number  of  condidons  must  be  checked  after  gaining  the  locks. 
This  results  in  the  differences  between  the  mergeup  and  the  mergedown  protocols. 

Two  po^ble  responses  may  come  back  to  the  directory  manager  from  a  bucket 
manager,  either  bucketdone  or  update.  Bucketdone  will  generally  signify  that  no 
directory  modificadons  are  needed  and  the  directory  manager  may  now  forget  about 
this  request  An  update  message  calls  for  scheduling  an  update  on  the  local  copy 
according  to  version  number  and  nodfying  all  other  directory  managers  by 
broadcasting  a  copyupdate  messs^e.  For  each  outstanding  unacknowledged  remote 
directory  modification,  a  counter  is  incremented  that  serves  to  prevent  garbage 
collection.  A  bucket  may  not  be  deallocated  until  all  directories  send  an  acknowledge 
message.  Upon  receiving  a  copyupdate  message,  a  directory  manager  schedules  the 
update  on  its  local  copy  and  when  the  changes  have  been  applied  (and  in  the  case  of 
delete  operations,  when  no  outstanding  requests  remain  at  this  manager), 
acknowledgements  are  sent 

Because  obsolete  directory  infonnation  is  usable,  the  multiple  copy  update  does 
not  have  to  be  strictly  synchronized  (in  the  sense  of  an  atomic  transaction).  However, 
the  ordering  of  different  directory  modifications  due  to  operations  on  the  same 
bucket  ^uld  be  the  same  across  all  copies  and  determined  by  the  order  in  which 
the  bucket  operations  are  performed.  Each  split  or  merge  changes  the  version 
numbers  of  the  affected  buckets.  A  ^lit  generates  two  buckets  with  version  numbers 
(me  greater  than  that  of  the  original  buck^  A  merge  results  in  (me  bucket  with  a 
version  number  one  larger  than  the  maximum  version  of  the  two  partners.  The 
version  number  in  each  diiedoiy  entry  should  matdi  the  version  of  the  bucket  it 
points  to  when  die  directory  is  competely  up  to  date.  Ea(^  directory  manager  applies 
the  modifications  indicated  by  an  update  or  copyupdate  message  to  its  local  copy 
when  the  version  numbers  of  the  affected  duectory  entries  matdi  the  version 
numbers  in  die  message  which  reflect  the  versions  of  die  buckets  involved  This  use 
of  version  numbers  for  scheduling  updates  enforces  the  desired  ordering.  The 
following  example  illustrates  why  this  ordering  approach  is  adopted  SuppOM  first  a 
split  operation  is  performed  almost  immediately  followed  by  a  merge  invcrfving  diose 
two  buckets.  Imagine  a  directory  manager  that  hears  about  these  updates  in  the 
opposite  order  and  i^iplies  them.  The  directory  update  related  to  the  merge  would 
essentially  have  no  e^ect  since  the  split  had  not  yet  been  processed.  The  subsequent 
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update  related  to  the  split  would  result  in  directory  entries  leading  to  a  deleted 
bucket  At  this  point  the  directory  is  usable  since  next  links  provide  recovery. 
However,  since  it  appears  that  both  messages  have  been  serviced,  the  deleted  bucket 
could  then  be  deallocated.  This  would  leave  that  copy  of  the  directory  in  a  truly 
incorrect  state  from  which  recovery  would  be  impossible. 

Under  the  assumptions  that  processes  do  not  fail,  mes^e  buffering  is  sufficient 
to  eliminate  blocking  on  a  send,  and  messages  are  reliably  delivered,  then  this 
solution  can  be  shown  to  be  deadlock  free  and  correct  in  the  sense  that  requests  are 
serializable  in  their  externally  observable  behavior.  Although  extremely  unlikely,  the 
theoretical  possibility  of  indefinite  posqxinement  does  exist 

In  discussing  the  correctness  of  this  algorithm,  we  wish  to  separate  the  arguments 
concerning  the  replication  of  the  directory  fr^  those  about  the  basic  protocols  and 
processing  that  service  the  user’s  request  This  allows  us  to  view  ffie  replicated 
directories  as  a  single  global  directory  with  certain  desirable  properties  in  later  phases 
of  the  discussion.  Intuitively,  we  need  a  statement  to  the  effea  that  the  information 
gathered  from  a  directory  access  may  not  accurately  reflect  the  current  state  of  the 
hash  file;  but  it  is  incorrect  in  such  a  way  diat  next  links  provide  adequate  recovery. 
We  now  attempt  to  formalize  this  idea  somewhat  and  then  show  that  our  multiple 
copy  update  strategy  actually  maintains  dm  property.  Throughout  this  presentation, 
the  term  transaction  is  used  for  the  execution  of  a  single  find,  insert,  or  delete 
operation  as  it  moves  thorugh  various  managers. 

The  version  of  the  directory  seen  by  a  transaction  can  be  expressed  as  one 
member  of  a  set  of  schedules,  5,  that  defines  the  state  of  the  directory.  For  this  to 
make  sense,  we  must  elaborate  on  the  notion  of  a  schedule.  Consider  the  set.  A,  of  all 
split,  merge  and  remove  (enabling  garbage  collection)  actions  resulting  from  update 
operations  that  have  changed  the  bucket  structure  by  the  time  of  the  directory  access 
in  question.  For  example,  a  delete  request  may  require  no  directory  modifications  at 
all  or  it  may  generate  a  merge  and  subs^uent  remove  that  become  members  of  A. 
There  is  a  partial  ordering  imposed  in  these  actions  based  on  when  bucket 
modifications  are  made.  Specifically,  if  two  operations  affect  the  same  bucket,  then 
there  is  a  relationship  established  between  them.  A  schedule  is  a  totally  ordered 
subset  of  A  that  obeys  the  following  constraint:  The  order  of  actions  within  the 
schedule  must  be  consistent  with  the  partial  order.  No  individual  schedule  in  the  set 
5  necessarily  represents  the  timing  of  bucket  modifications;  but  rather,  it  can  be 
viewed  as  encoding  a  valid  directory  structure  at  some  point  during  a  possible 
execution  sequence  of  the  actions  in  >4.  An  action  is  considered  done  and  its  effects 
incorporated  into  the  directory  when  it  appears  in  each  schedule  of  S.  All  otiier 
actions  are  still  in  progress.  In  the  case  of  a  delete  truest  that  causes  two  buckets  to 
be  merged,  the  deleted  bucket  is  not  deallocated  until  the  associated  remove  action  is 
done  so  recovery  through  its  next  link  is  still  possible.  The  point  is  that  the 
appropriate  next  links  are  set  up  before  the  related  split  or  merge  action  appears  in 
any  schedule  of  S  and  deleted  buckets  remain  in  place  until  all  sdiedules  include  the 
relevant  remove  action.  Consequently,  any  member  of  S  represents  usable 
information. 

In  the  implementation  of  the  replicated  directories,  each  copy  corresponds  to 
one  schedule  in  the  set  The  sequence  of  actions  in  a  schedule  in^cate  the  order  of 


directory  updates  applied  to  that  copy.  A  split  action  signifies  the  local  execution  of 
the  up^t^rectory  procedure  and  possibly  doubledirectory.  A  merge  action 
represents  the  execution  of  halvedirectory  or  updatedirectory.  Remove  denotes  the 
equivalent  of  placing  an  exclusive-lock  on  the  local  copy  (i.e.  testing  the  readcount). 
Inclusion  in  the  set  >4  can  be  defmed  by  the  set  of  update  messages  that  have  been 
sent  from  bucket  managers  to  directory  managers.  A  sequence  of  these  actions  is  an 
appropriate  model  for  the  state  of  a  single  copy  since  the  corresponding  code  sections 
are  performed  serially  by  the  manager.  There  are  various  ways  of  enforcing  this 
r^uirement  In  the  multiplexed  directory  manager  given,  access  to  its  copy  of  the 
directory  by  concurrent  transactions  is  controlled  by  explicit  scheduling,  the  receipt 
of  a  mes^e  establishes  a  context  for  the  resulting  processing  and  the  directory 
structure  is  put  into  a  consistent  state  before  the  context  changes  again.  Either  the 
required  values  are  contained  within  the  incoming  message  to  initialize  the  context 
(e.g.  copyupdate  or  request  messages)  or  saved  values  that  were  previously  tagged 
widi  a  transaction  number  are  restored  when  further  stqps  must  be  taken  on  behalf 
of  the  transaction  (e.^  due  to  arrival  of  an  update  message).  The  directory  up^tes 
are  sdieduled  locally  in  response  to  receipt  of  an  update  or  copyupdate  message.  Our 
requirements  state  that  this  scheduling  must  be  consistent  with  the  partial  ordering 
on  actions.  This  is  accomplished  using  the  version  numbers.  Each  split  or  merge 
charrges  the  version  numbers  of  the  affected  buckets.  A  split  generates  two  buckets 
with  version  numbers  one  greater  than  that  of  the  original  bucket  A  merge  results  in 
one  bucket  with  a  version  number  one  larger  than  the  maximum  version  of  the  two 
partners.  The  partial  ordering  is  determined  from  the  buckets  and  resulting  version 
number  associated  with  each  action.  For  example,  consider  the  following  set  of 
actions  applied  to  the  hash  file  in  Figure  3  where  the  format  for  an  individual  action 
is  <t^  of  action  and  transaction  number,  first  bucket  involved,  second  bucket, 
resulting  version  number>: 

{<split  1,  bucket  a,  bucket  d  (new),  version  2> 

<^lit  2,  bucket  c,  bucket  e  (new),  version  3> 

<split  3,  bucket  e,  bucket  f  (new),  version  4> 

<merge  4,  bucket  d,  bucket  a,  version  3> 

<merge  S,  bucket  b,  bucket  a,  version  4>}. 

Then,  using  <  for  the  precedence  relation, 

split  1  <  merge  4  <  merge  S  and  split  2  <  split  3. 

Eadi  directory  manager  schedules  updates  on  its  copy  based  on  its  record  of 
which  actions  have  already  been  incorporated  into  the  structure.  This  information  is 
encoded  as  version  numt^rs  in  each  entry  of  the  table  to  be  matched  against  the 
version  number  of  updates  (data  supplied  in  the  update  or  copyupdate  message). 
Specifically,  die  Boolean  function,  VersionsDoNotMatdi,  must  calculate  the  indices 
of  the  affected  directory  entries  (using  the  pseudokey,  whether  the  operation  was  an 
insertion  or  a  deletion,  and  the  local  depth  of  the  buckets  prior  to  modification)  and 
then  compare  version  numbers  of  the  entries  and  the  message. 

The  requirement  that  deleted  buckets  remain  available  until  all  schedules 
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contain  the  associated  remove  action  is  enforced  by  a  conservative  approach.  The 
directory  manager  initially  contacted  for  a  request  to  delete  that  causes  two  buckets 
to  merge  is  responsible  for  determining  when  the  space  can  be  reclaimed.  It  must 
collect  acknowledgements  related  to  the  merge  from  ^1  other  directory  managers  and 
wait  until  transactions  using  old  information  from  its  own  copy  have  finished  before 
the  partner’s  page  can  be  deallocated.  In  fact,  the  directory  manager  waits  for  all 
outstanding  acknowledgements  and  a  quiescent  local  state  before  tri^ering  garbage 
collection.  Other  directory  managers  wait  until  there  are  no  transactions  using  their 
copies  before  sending  acknowl^gements  for  deletions. 

The  next  step  is  to  assume  a  well-behaved  global  directory  and  show  that 
concurrent  transactions  do  not  interfere  with  each  o^er  or  destroy  Ae  data  structure. 

First,  we  need  to  demonstrate  that  the  search  phase  of  a  transaction  arrives  at 
the  right  bucket  The  user’s  r^uest  for  an  operation  may  be  directed  to  any  available 
directory  manager.  In  servicing  this  request  the  manager  generates  a  transaction 
number,  decides  which  bucket  manager  to  contact  and  saves  some  state  about  the 
transaction.  The  information  used  to  determine  the  appropriate  bucket  manager  may 
be  out  of  date  because  of  insert  or  delete  operations  that  are  still  in  progress  (i.e.  the 
associated  update  or  copy  update  message  has  not  yet  been  processed). 

Imi^ne  a  searching  transaction  that  indexes  into  the  directory  and  finds  a 
pointer  to  bucket  A  as  that  directory  entry  is  about  to  be  changed  to  reflect  a  split  or 
merge.  If  A  has  recently  been  split,  i4’s  next  link  will  lead  to  the  new  bucket  which 
contains  the  records  moved  from  A.  If  A  has  just  been  merged  into  its  partner,  it  will 
be  marked  as  deleted,  making  it  the  ’’wrong  bucket"  for  any  search  and  the  next  link 
again  will  provide  recovery.  The  important  observation  is  that  obsolete  directory 
entries  that  are  still  visible  always  point  to  a  bucket  from  which  the  correct  bucket  is 
reachable  via  next  links.  The  changes  in  the  bucket  structure  appear  as  atomic  actions 
to  concurrent  transactions.  In  our  formulation  of  the  bucket  manager,  a  slave  process 
is  spawned  for  each  transaction  within  each  manager  involved  in  the  transaction. 
Thus  there  is  the  need  for  locking  to  control  concurrent  access  to  a  manager’s 
buckets.  Adding  or  removing  a  key  without  cauang  restructuring  is  done  in  a  single 
disk  put  operation.  If  the  target  bucket  for  an  insertion  is  full,  it  will  be  replaced  by  a 
pair  of  buckets  in  which  the  old  contents  are  distributed  between  the  two  according 
to  pseudokey.  The  new  record  will  be  included  in  the  appropriate  partner  if  there  is 
room.  The  second  half  of  the  pair  is  written  first  in  a  newly  allocated  disk  page  and 
dien  the  old  bucket  is  replac^  by  die  first  half  of  die  pair.  Immediately  after  the 
first  put,  the  new  bucket  is  still  not  reachable  through  pointers  in  the  hash  file.  Thus 
writing  ^e  pair  is  equivalent  to  the  single  operation  of  writing  the  first  partner.  Two 
buckets  that  are  being  merged  are  protected  with  exclusive-locks  so  intermediate 
states  are  not  visible.  Upon  arriving  at  the  right  bucket,  a  process  performing  an 
insert  or  delete  must  ato  see  the  right  version  of  it  A^un  a  lock  which  excludes 
other  updaters  is  required  in  order  to  read  the  bucket  contents  into  private  storage 
and  is  held  until  the  bucket  is  rewritten  (or  it  is  discovered  that  no  change  is  needed). 
Thus  previous  updaters  have  made  their  modifications  known  by  the  time  a  new 
updater  gains  its  lock.  Processes  executing  the  find  operation  may  legitimately  see 
either  an  old  or  the  new  version  of  the  target  bucket 

Next  we  consider  potential  inference  among  update  transactions.  Once  an 
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Update  arrives  at  the  right  bucket  and  gains  the  locks  it  quires,  the  actual 
modifications  are  essentisdly  serialized.  Thus  updaters  work  with  the  most  recent 
version  of  that  bucket  However,  for  a  deleter  to  get  to  the  point  where  it  has  all  the 
locks  its  needs  can  be  somewhat  involved  if  the  target  bucket  is  the  "1"  partner  of  a 
potential  merge.  The  deleter  must  release  its  lock  on  the  target  bucket  place  a  lock 
on  the  ”0"  partner,  and  then  re-lock  the  "I"  partner.  While  this  is  taking  place,  other 
update  operations  may  be  affecting  diese  buckets.  In  particular,  a  concurrent 
insertion  could  add  new  records  to  the  target  budcet  once  the  deleter's  lock  is 
released  so  that  it  is  not  longer  empty  enough  to  allow  merging.  It  is  even 
theoretically  possible  for  a  stream  of  inserters  to  fill  up  the  target  bucket  and  cause  a 
split,  thereby  moving  the  key  that  is  to  be  (telemd.  In  addition,  another  deleter  might 
get  Ae  two  partners  locked  and  merged  before  the  deleter  we  are  focusing  on  does. 
Eai^  of  these  conditions  is  checked  for  and  the  pitfalls  avoided.  After  gaining  the 
lock  on  the  "0"  partner,  the  deleter  checks  whether  merging  might  be  possible  (the 
partner’s  next  link  points  to  the  target  bucket),  and  if  this  check  fails,  it  goes  back  to 
simply  trying  to  remove  its  key.  If  Ae  two  buckets  are  not  linked  in  this  way,  it  may 
mean  the  localdepths  do  not  match  or  that  the  target  bucket  has  been  deleted. 
Attempting  to  lock  the  target  bucket  under  diese  circumstances  would  carry  with  it 
the  danger  of  deadlock.  Upon  finding  the  two  buckets  direcdy  linked  and  re-locking 
the  "1"  partner,  the  deleter  checks  the  emptiness  of  the  bucket,  whether  the  desired 
key  is  still  there,  and  whether  localdepths  still  match  before  going  ahead  with  the 
merge.  Unless  the  key  has  moved,  the  deleter  at  this  point  would  have  the  needed 
locks  and  no  further  interference  could  occur  at  the  bucket  level. 

Bucket  manipulations  that  are  completely  contained  within  one  bucket  manager 
work  almost  exacdy  like  the  centralized  solution  (Ellis  83).  Processii^  may  go  outside 
the  boundaries  of  one  bucket  manager  if  the  search  phase  has  arrived  at  the  wrong 
bucket  manager,  a  split  is  required  and  no  space  is  available  locally,  or  a  merge 
appears  necessary  and  the  partner  is  remote.  In  each  of  these  situations,  a  second 
bucket  manager  becomes  involved.  In  this  presentation  of  the  algorithm,  an  off-site 
split  is  handled  directly  by  the  front  end  process  since  it  does  not  affect  existing 
buckets  in  the  second  manager's  partition.  For  the  other  cases,  another  slave  is 
created  for  the  transaction  by  the  second  manager.  A  wrongbucket  message  transfers 
the  necessary  state  for  continuation  of  processing  at  the  new  site.  Calls  to  SendAck 
and  SendBucketdone  generate  messages  that  trigger  the  releasing  of  read-locks.  If  a 
split  is  called  for,  two  or  three  processes  (i.e.  the  originating  directory  manager,  the 
bucket  manager  slave  currently  responsible  for  the  full  bucket,  and  po^bly  a  bucket 
manager  front  end  widi  available  spara)  beocmie  involved;  however,  there  is  no  real 
parallelism  among  them  so  the  order  in  which  the  disk  operations  take  place  is  well- 
defined 

The  merge  is  slightly  more  complex.  There  are  two  cases  to  consider  based  on 
which  of  the  partners  the  original  bucket  manager  has.  The  Mergedown  message  and 
its  associated  reply  are  used  when  the  first  manager  has  the  "0"  partner  of  the 
potential  merge  to  share  state  values  needed  by  the  other  manager  (e.g  the 
kx^depths  of  the  two  buckets  must  be  compared  and  new  links  must  be  set  up  in 
both  buckets).  The  Mergeup  protocol  (i.e.  Mergeup,  MergeUpReply,  and  GoAhead 
messages)  serves  to  exchange  the  information  needed  for  the  extra  checking  on 
mergability  described  above.  Parallelism  is  allowed  between  the  two  bucket 
managers;  however,  because  of  the  exclusive-locks  protecting  the  two  partners,  the 
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ordering  of  disk  operations  does  not  matter. 

The  freedom  from  deadlock  argument  depends  on  the  fact  that  locks  are 
requested  according  to  an  ordering  on  the  buckets.  While  a  bucket  is  locked, 
ad^tional  locks  are  requested  only  on  buckets  reachable  from  it  via  next  links.  Given 
the  way  deleted  buckets  are  handled,  it  is  not  true  that  the  ordering  between  two 
buckets  stays  the  same  for  as  long  as  both  exist  Thus,  initially  bucket  B  may  be 
readiable  bom  bucket  A  but  if  they  are  partners  this  relationship  may  be  reversed  as 
B  is  merged  into  A.  However,  it  is  not  possible  for  transactions  following  the  old 
ordering  to  coexist  with  ones  following  the  new  ordering  because  during  deletion 
exclusive-locks  are  used  to  ensure  that  all  die  slave  processes  with  old  information 
have  cleared  out  of  the  vicinity  of  the  merge.  Extra  precautions  must  be  taken  by  the 
slaves  involved  in  a  deletion  to  check  that  the  locking  of  partners  is  consistent  with 
reachability. 

This  distributed  implementation  not  only  has  locking  as  a  potential  source  of 
deadlock  but  also  involves  message  flows  and  internal  scheduling  of  requests  within 
managers.  It  is  necessary  to  demonstrate  that  diese  factors  do  not  introduce  deadlock. 
A  transaction  could  be  blocked  if  it  requires  service  from  a  process  that  is  blocked  on 
a  receive  message  primitive  or  it  is  stuck  in  one  of  a  directory  manager’s  scheduling 
tables. 

First,  consider  the  message  flows.  Ignoring  name  lookup  for  ports,  there  is  a 
single  receive  point  in  the  directory  manager  o^e  (in  the  procedure  GetMessage  at 
the  top  of  the  outer  loop)  and  it  accepts  any  incoming  message  regardless  of  message 
type  or  identity  of  sender.  Basically  the  same  n^tement  holds  for  the  bucket  manner 
^nt  end  processes.  Each  instance  of  a  bucket  slave  is  dedicated  to  one  transaction. 
This  fact  simplifies  the  analysis  of  protocols  between  bucket  managers.  For  each 
receive  point  in  the  bucket  slave  code,  we  can  characterize  the  state  of  both  the 
sender  and  receiver.  For  example,  the  receivemessage  in  the  find  case  is  executed 
only  when  onmachine  -  false  and  SendWrongbucket  has  been  done.  This  imples 
that  messageid  =  Wrongbucket  in  the  other  dave  process  and  SendAck  is  eventually 
executed.  It  is  easy  to  see  that  the  message  flows  through  bucket  managers  do  not 
cause  deadlock  by  doing  this  kind  of  analysis  for  each  receive  point 

There  are  four  ways  in  which  a  transaction  can  get  delayed  within  directory 
managers:  it  may  be  in  the  context  table  awaiting  a  bucketdone  or  update  message 
from  a  bucket  manager,  its  directory  updates  may  be  delayed  until  versions  match, 
oopyupdate  acknowl^ements  for  deletions  may  be  waiting  for  the  equivalent  of 
local  exclusive -locki^  (i.e.  a  readoount  of  zero),  and  the  initiation  of  garbage 
collection  may  be  waiting  for  the  analogue  of  global  exclusive- locking  (i.e.  local 
exclusive -locking  plus  receipt  of  outstanding  copyupdate  acknowledgements).  Tlie 
first  case  presents  no  problem  as  long  as  a  bucketdone  or  update  message  is  sent 
back  to  die  originating  directory  manner  for  each  find,  insert,  or  delete  message. 
This  is  true  as  can  be  seen  by  following  each  branch  of  the  bucket  slave  code  for 
handling  the  find,  insert,  and  delete  message  types.  The  second  case  requires  a 
guarantee  that  versions  eventually  do  match.  The  update  message  contains  the  old 
version  numbers  and  the  oldlocaldepth  of  the  two  buckets  involved.  The 
oldlocaldepth  and  the  preudokey  are  used  to  determine  which  directory  entries  must 
have  the  matching  version  numbers.  The  basis  of  the  argument  that  the  desired 
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pattern  of  version  numbers  eventually  occurs  is  the  partial  ordering  on  transactions 
previously  described  and  the  way  this  partial  ordering  is  implemented  using  the 
version  numbers.  The  ordering  on  transactions  affecting  shared  buckets  precludes 
two  transactions  each  waiting  for  the  other  to  advance  a  version  number  in  the 
directory.  The  third  and  fourth  sources  of  delay  are  related.  The  key  observation  is 
that  readcount  represents  the  number  of  transactions  initially  hantUed  locally  that 
have  not  yet  applied  their  modifications  to  the  /oca/  o^y  of  the  directory. 
Copyupdates  are  not  reflected  in  the  value  of  readcount  It  is  possible  for  readcount 
to  r«dh  zero  if  new  requests  do  not  continually  arrive  since  delayed  updates  are  not 
permanendy  blocked.  Copycount  becoming  zero  at  some  directory  manager  depends 
on  eadi  dirwtory  manager  independendy  readiing  the  point  where  it  is  finished  with 
all  but  the  garfa^  collection  woric  of  the  transactions  it  is  responsible  for.  Thus, 
sending  remembered  acknowledgements  and  garbage  collection  can  be  indefinitely 
postponed  by  a  steady  stream  of  new  requests  but  deadlock  among  a  fixed  set  of 
transacdons  is  not  possible. 

3.  Incorporating  Fault  Tolerance 

The  solution  just  described  does  not  address  the  issues  of  crash  tolerance  and 
recovery.  The  structure  of  that  solution  reveals  that  it  is  a  fairiy  straightforward 
adaptation  of  the  earlier  concurrent  algorithm.  Consideration  of  crash  recovery 
suggests  a  slighdy  different  organization. 

The  problems  associated  with  processor  and  communication  failures  could  be 
conveniendy  avoided  if  it  were  possible  to  embed  upda^  to  the  hash  file  in  a  system 
based  on  atomic  transactions.  However,  the  goals  guidii^  the  design  of  our  solution 
(e.g.  concurrency  and  availability)  have  led  to  locking  protocols  that  are  not 
compatible  with  standard  commit  protocols.  As  we  shall  see,  the  atomic  transaction 
construa  is  a  useful  tool  when  applied  to  small  groups  of  ste^  within  the  processing 
of  an  individual  update  operation. 

The  kinds  of  failures  being  addressed  include  the  failure  of  a  manager  with  loss 
of  all  associated  volatile  pixx^  state  but  not  of  its  portion  of  the  hash  tile  residing 
on  disk.  Lost  messages  and  network  partitions  are  also  considered.  We  assume  that  it 
is  possible  to  detect  the  death  of  a  process.  The  IPC  mechanism  used  here  as  the 
model  of  communication  provides  notification  to  potential  senders  when  a  port 
disa4>pears  (for  example  as  a  consequence  of  the  receiver’s  death)  so  this  assumption 
is  reasonable. 

The  most  significant  problems  with  the  current  distributed  solution  have  to  do 
widi  interactions  among  the  directory  servers.  In  particular,  directory  updates  are 
funneled  dirougjh  the  one  directory  server  initially  contacted  for  the  oi^tion  and  it 
forwards  copyupdate  messages  to  all  the  ofoer  directory  managers.  A  direct^ 
manager  can  not  allow  the  garbage  collection  of  the  set  of  deleted  buckets  for  whidi 
it  is  responsible  until  it  has  collected  acknowledgements  from  all  otiier  directory 
manage^  Furthermore,  if  a  server  goes  down  before  propagating  tiie  directory 
update  information,  the  scheduling  of  other  updates  at  ocher  managers  is  affected. 

In  order  to  prevent  a  failed  directory  manager  from  holding  up  completion  of  an 
operation,  we  need  the  ability  to  remove  unavailable  servers  from  participation  in  the 
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normal  directory  update  routine  and  re'enlist  them  later.  Thus,  an  individual  bucket 
may  be  deallocated  when  all  directory  servers  either  acknowledge  the  associated 
up^te  message  or  are  designated  as  being  down.  This  approach  requires  additional 
information  in  acknowledgement  messages  (i.e.  identihcation  of  transaction  and 
sender)  and  more  state  kept  by  the  server  responsible  for  outstanding  (not  yet  fully 
acknowledged)  directory  updates.  A  directory  manager  that  is  rejoining  the  system 
must  construct  a  sufficiently  up-to-date  copy  of  the  directory  before  resuming  normal 
processing.  The  key  observation  is  that  the  buckets  contain  the  necessary  information 
for  builttog  such  a  copy  (i.e.  localdepth,  commonbits,  and  the  next  link).  The 
starting  point  for  a  scan  of  buckets,  namely  the  first  bucket,  is  in  a  fixed  location  and 
never  moves  during  restructuring;  so  any  old  version  of  the  directory  can  be  used  to 
find  it  The  manager  follows  next  links  throu^  all  the  buckets  using  a  lock-coupling 
protocol  with  selective-locks.  This  iqiproach  leaves  the  recovering  manager 
vulnerable  to  failures  of  bucket  managers.  If  this  possibility  is  determined  to  be 
unacceptable,  the  manager  can  start  with  a  reasonably  good  copy  of  die  directory 
(acquii^  from  a  healthy  directory  server)  and  use  the  bucket  scan  to  verify  and 
update  its  entries.  In  this  case,  upon  encountering  an  unavailable  bucket  manager, 
the  server  can  take  a  fairly  low-risk  chance  of  missing  some  information  and  skip 
over  those  buckets. 

The  second  major  aspect  of  this  more  fault  tolerant  solution  is  a  reassignment  of 
responsibilities.  Rather  than  having  the  propagation  of  directory  updates  handled  by 
one  of  the  directory  managers,  the  bucket  manager  in  charge  of  the  bucket  update 
broadcasts  the  directory  update  messages  and  collects  the  acknowledgements.  Since 
the  directory  manager  initially  contacted  does  not  assume  responsibility  for  an 
operation  once  the  appropriate  bucket  manager  has  the  target  bucket  locked,  the 
bucket  manager  can  immediately  send  a  bucketdone  message  to  allow  the  directory 
manager  to  forget  any  state  it  had  saved  about  the  transaction  and  later  send  it  an 
update  message  if  necessary.  The  bucket  managers  must  maintain  a  list  of  directory 
managers  believed  to  be  up.  The  bucket  scan  performed  by  a  recovering  directory 
manager  serves  to  announce  its  existence  to  the  bucket  managers.  In  addition,  bucket 
managers  can  periodically  exchange  their  up-lists.  Recovering  bucket  managers  must 
acquire  a  current  up-list  and  send  information  to  directory  managers  for  verification 
of  their  entries  for  its  buckets.  The  removal  of  a  network  partitioning  is  detected 
during  the  exchange  of  up-lists  and  dealt  with  by  the  bucket  manager  recovery 
mechanism. 

The  advantage  of  this  reorganiza^n  is  that  the  failure  of  a  bucket  manager 
prevents  subsequent  updates  concerning  those  buckets  from  occurring  so  the  fact 
that  the  directory  updates  may  not  get  sent  is  not  as  much  of  a  problem  as  it  is  when 
one  of  the  direi^ry  managers  is  suppos^  to  send  those  update  messages  and  it  is 
down.  In  diat  case,  there  is  more  potential  for  subsequent  bucket  operations  whose 
assodated  directory  updates  will  be  held  up  by  die  missing  messages. 

The  remaining  details  needed  for  fault  tolerance  are  applications  of  stuidard 
techniques  sudi  as  timeout  and  retransmission  of  memages.  The  act  of  writing  die 
two  buckets  involved  in  a  merge  operation  back  to  disk  should  be  done  atomically 
using  a  commit  protocol.  The  order  of  writing  the  two  buckets  involved  in  a  split 
operation  makes  them  visible  in  one  atomic  step  but  failures  during  this  action  may 
result  in  the  allocation  of  a  new  bucket  that  never  gets  incorporated  into  the  data 


I 


structure.  It  may  be  convenient  to  enclose  the  disk  operations  involved  with  splitting 
within  an  atomic  action  as  well 

Figure  9  shows  the  revised  message  flow  for  the  increased  degree  of  fault 
tolerance  provided.  Figures  10  and  11  give  the  pseudocode  for  ^e  managers 
implementing  this  solution. 


Request 

X 


Response 


Recover 


Directory 

Manager 


Bucket 


Request 

recovery 


Bucket  data  \  Manager 


Find,  Insert,  . 

Delete  r  I 

Bucket-  \  RecoverAck 

done  \  I  Up-list 

\  \  I  exchange 


Recover 


Response 


Ack  MUriply 

\  \  Go  ahead  # 

Bucket  data  i 


Directory 

Manager 


Bucket  data 

Bucket-^ 

recovery 

Update 


Bucket 

Manager 


Bucket  data 


Merge-down 
Split  bucket 
Wrongbucket 
Recover 
SplitCanceP 


M  0  reply 
Splitreply 
Wrongbucket 
ack 

RecoverAck 


Bucket 

Manager 


Ignoring  commit  protocols 


Figure  9  Fault  Tolerant  Messages 


V 


Figure  10  Pseudocode  for  Fuih  Tolerant  Directory  Managers 
wttile(lrua)  { 

mesaagoid  ■  GotMesaage  (Amag): 

/*  Eiiher  receives  a  message  or  takes  a  message  off 
the  list  of  delayed  but  now  ready  directory  updates. 

Messageid  -  "timeout'*  if  list  is  empty  and  receive  primitive 
times  out.  */ 

•witch  (maaaagaid)  ( 

caaa  raquaat:  /*  from  user  */ 

CticuMt  ptaudokay  and  locate  current 
ineernetion  of  the  huceet  manager  ratponsibie 
for  desired  bucket: 

/'Maps  bucket  manager  id  to  pon  either  through 
a  kteid  cache  or  the  IPC  name  server*/ 

H  {valid  port  loundi  { 

raadcourtt  -  raadcount  *  1; 

/'number  of  tiaasactions 
currently  using  directory  data*/ 

Generate  transaction  M  and  save  state  related 
to  this  request: 

Set  timer  for  transaction  Jf: 

Construct  and  send  a  "find",  "insert"  or 
"delete"  message  to  bucket  manager: 

} 

aiaa  send  user  a  failure  reply; 
break; 

caaa  bucketdona;  /*  Ihnn  bucket  manager  */ 
if  {transaction  0  not  in  use) 

/'This  is  a  duplicate  message  or  sute  lost  in  crash*/ 
break; 

raadcount  ■  raadcount  ■  t ; 

ClaanStata  (mag.tranaaction  # ); 

/'  forget  abwt  this  request  */ 
break; 

caaa  update:  /*  from  bucket  manager  */ 
if  (VarsionaOoMalMatch(mag)) 

/*  compares  version  numbers  in  message  with 
version  numbers  in  corresponding  directory  entries: 
delects  duplicate  updau  messages  that  have  already 
been  done  and  reissues  acks  in  appropriate  way  */ 

OelayUpdaiefmag); 

/'Etoinates  duplicates  of  messages  in  queue*/ 

alaa  { 

H  (mag  op  ■  -  Inaart)  { 

Apply  appropriate  updates 
to  loeat  copy  of  directory: 

SandAckfmag.ackport,  transaction  # ,  myport); 

/'respond  to  bucket  manager 
who  iaitiaHd  this  update  */ 

) 

afaa  {  /*  op  s  delete  */ 

Apply  focal  directory  updates: 
newatnbarAclt(mag.acttport.  vanaaction  # ,  myport): 

/*  save  up  acks  until  the  equivalent  of 
CKhaivcIockiag  occurs  */ 

) 

RataaaaSauadO; 

/'If  Sniihing  this  directory  update  enables  previously 
delayed  ones,  make  them  accessible  to  GeiMessage  */ 


Contact  IPC  name  serve'  to  locate  a  heallhy 

directory  manager  and 

send  a  "request  copy"  message  to  it. 

Receive  response  containing  copy  ol  directory: 
}  while  (timeout  or  emergency  message  received). 
NoOoodMaasage  •  true; 
while  (NoGoodMeaaage)  ( 

lookup  port  10  first  bucket  manager 
If  {no  valid  port  lound)  delayO. 
afsa  { 

send  "recover"  message: 
while  (true)  ( 

Receive  message: 
if  (hmaout)  break; 
if  (it  la  a  bueketdaia  message)  { 
NoQoodMesaage  •  false: 
break; 

) 

if  (emergency  messege  about 
this  port)  ( 
delayO. 
break. 

) 

/'ignore  irrelevant  emergency 
message  or  duplicate  directory  copy  */ 


/'falls  through  to  neit  case*/ 

case  bucketdata:  /'from  bucket  managers 
in  response  to  recover  message*/ 

Update  directory  entry  with  information  in  message: 
H  {all  bucket  managers  have  replied) 

/'all  directory  entries  have  been  verified*/ 

Publicite  own  named  port  with  name  server. 
/'Implies  this  manager  now  has  healthy  copy- 
and  will  now  serve  users'  "request"  messages 
and  recovering  direemry  managers'  "request 
copy"  messages  (accesses  gotten  through 
name  server)*/ 


M  buckotracowory:  /'from  recovering  bucket  manager/* 
Updeie  directory  entries,  if  necessary,  to  reflect 
true  state  of  buckets: 

Cache  port  of  bucket  maneger  in  focal  name  table: 
braak; 

la  amatgancy:  /'(inn  IPC  -  notification  of  pon  death*/ 
Remove  port  from  cached  neme  teble: 

Flaco  an  Indicator  In  slate  of  each  eaaocieied 
transaction  that  initially  contacted  port  hea  died: 

/'doesn't  just  abon  tmisaction  siaoe  another 

bucket  manager  may  now  he  involvad  (wioagbucket  protocol) 

so  waits  until  timer  for  tnnsaction  goes  off  */ 


miQOUIv 


caaoroinit:  /'from  OS's  process  manager  that 
Rstaned  directory  manager  process. 

Note  this  uses  a  conservative  approach- 
doesn't  skip  bucket  managers*/ 


If  (Iraadcount)  SairdRomambaradAcksO: 

/'send  acks  saved  by  deletion  updates  */ 
for  (an  transactions,  t.  whose  timers  have  espired){ 
RoaloraStatod); 

H  (t.portdiad) 

Send  user  a  failure  reply: 
olao  Retransmit: 

} 


Figure  11  Pseudocode  for  Fault  Tolerant  Bucket  Managers 

Bucket  Manaftr  Fro«  M  Process: 


Bucket  Sta«c  Process: 


/’Note  Out  communicuion  between  bucket  managers  involves 
IPC  name  lookup-  the  presentation  here  generally  assumes  that 
a  valid  port  is  fbimd*/ 

while  (true)  { 

maagagild  -  roceivowmaiga  (Inwg); 

■witch  (moHagoM)  { 

caaeapWbtichel:  /*  Aim  another  bucket 
manager  with  no  available  umcc  */ 

AMoeatu  avAHablt  paga; 
putbuekot  (nowpoga,  mag.halK): 

/*as  wriueii.  (bihiic  here  makes  nmnegy  garbage*/ 

Sand  *SpMrRap/y*  maeaaga  eontalning  link  to  rww  Oueftet, 


)  caiKalaplit:  Oaalloeata  paga  aaaignad; 


■a  recover: 

Update  up-liai: 

Broadcast  revival  to  existing  slaves: 
Create  slave  to  handle  response: 
forward  message  to  it: 
break: 

M  emergency  message  about  slave: 
Update  transaction  #  -  slave  table, 


caae  rainit:  /’from  O.S.'s  process  manager*/ 

/’Get  an  up-list  from  a  neighboring  bucket  mantger- 
a  'neighbor*  being  another  bucket  manager  connected 
via  neat  or  prev  links  from  locally  managed  buckets- 
fttmt  end  process  probably  should  maintain 
a  cache  of  bucket  managers  id's  -  current  port  if  known*/ 
while  (true)  { 

locate  port.  p.  for  one  of  neighboring 
bucket  managers: 

Send  “empty"  up-kst  to  initiate  up-lisi  exchange: 

maeaagald  •  receivemeaaege  (tmag): 

/’with  finite  tnneout*/ 

H  (meeeegaid  ■  •  up-lial  exchange)  break; 

lor  (all  directory  managers  in  msg.up-lisi) 

Send  “buckeirecovery"  message: 
lillergeUp-liats(tup~list.  mag  up-lisO; 

Allocate  public  port  end  assert  my  long-term  name  for  it: 

break; 

cane  up-Hnl  exchange: 

M  (nor  a  reply)  Send  own  up-llsi  in  response: 
UergeUp-iists  (Sup-kat,  mag.ur  Hat): 
ter  (all  directory  managers  in  msg.up-iist  ■  up-kst)  I 
Send  "bucketrecovery"  message: 

Broadcast  revival  to  existing  sieves; 


■e  bmeout: 

mmate  up-kst  exchange  wUh  aH  healthy  neighbora; 
/'don't  worry  if  valid  port  can  not  be  found  for  one*/ 


V  Branaaciionkt  not  yet  seen)  { 

Create  a  bucket  slave  process  and  forward  mag  to  it; 
fteeord  transaction  kt  -  stave  mapping; 

diaa  (  /'duplicatt  message*/ 
lookup  tranaaction  » . 
d  (asaociatad  slave  atm  alive) 
forward  message: 
akaa  Send  appropriate  reply: 

} 


meiaageid  •  receiveffioosage  (fimag): 

/’Includes  dau  needed  to  initialirc  local  copy  of  up-lisi*/ 

H  (meisageid  ■  >  wrongbucket)  sw  >  mag  op; 
elaeaw  >  meal  ag  aid; 

■witch  (aw)  { 

cnaa  recover  /'Could  try  to  package  information  about  consecuo'c 
buckets  in  one  message,  but  doesn't  in  this  version*/ 


SeiecliveLock  (oMpage); 

Send  “reeovereek"  to  msg.repiyport; 
gotbucket  (oldpege,  current); 

Construct  and  send  “bucketdata"  to  recoveriitg  directory; 
onmachine  ■  true; 
while  (onmechine)  ( 

newpage  -  current  ->  next; 
mecMne  >  current  ■>  nextmgr; 
if  (machifte  ■  ■  nil)  break; 
if  (machine  I  ■  me)  { 

Send  “recover"  to  next  manager: 
OetItecoverAckO;  /*  loops  until  achieved, 
retransmits  if  timeouL  WAITS  if  destination 
known  U>  be  down*/ 
onmechine  •  lalae; 


SelectiveLocfc  (newpage); 
gelbuckei  (newpage.  current): 
UnSetecfivetock  (oldpage). 
Send  “bucketdata": 
oldpage  ■  newpage; 


UnSelectiveUick  (oldpage): 
ClearOuplicaiaaO; 


le  find:  oktpage  ■  mag.page; 

ReedLock  (oldpage); 
if  (moaaageid  ■  ■  wrongbucket) 

Send  “Ack"  to  bucket  manager  holding 
previous  bucket:  /’allows  it  to  unlock  */ 

■tee  Send  “Bueketdone"  message  to  directory  manager, 
gelbucfcel  (oldpage.  current): 
onmechine  ■  true: 

/'Follow  next  links  until  current  is  the  light  bucket:*/ 
while  (current  la  wrong  bucket  fid  onmachine)  ( 
newpage  ■  current  •>  next; 
machine  ■  current  ->  nextmgr; 

N  (machine  I-  me)  {  /*  next  bucket  is  remote  */ 
Send  “Wrongbucket"  message  to  next  buckw 


alee  (  /*  next  bucket  is  local  */ 
needlock  (newpage); 
gatbucket  (newpage.  current): 
UnftendLock  (oldpege): 


N  (onmachine)  { 

H  (aanreh  (currem.  meg.key))/*  is  key  there?  */ 
teund  (meg.key); 

/'Scad  user  response  indicating  key  found*/ 
noltound  (mag.key): 

} 

■lae  OetWrongbucketneplyO;  /'See  below*/ 
UnRemtLook  (oldpage): 

QeatOupiicateaO;  /'See  below*/ 
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Figure  11  (continued) 
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case  intot:  onmachina  ■  trua: 
oMpao*  -  iMg.pase; 

SalactivaLook  (oUpage): 

H  (meaaageid  ■  •  wrongbuckat) 

Sana  'Ac*'  10  provious  buchot  manager; 
alaa  Sand  "bucketdone"  message: 
gatbuckal  (oUpaga,  currant): 
follow  next  links  until  current  is  Ifie  right  bucket 
(as  in  find  eaaa  except  use  Seleetive  locks  instead  of 
Head  locksi; 
if  (lonmacMna)  { 

GalWrongbuchalRaplyO; 

UnSalactivaLocfc  (oMpaga); 

La{ 

ff  (aaarch  (currant,  mag-hay))  {/’is  key  already  il>ere?*/ 
auccaaa  -  trua; 

UnSalactivaLock  (oMpaga): 

aha  a  (currant  ->  count  I  -  numantriaa)  { 

/*  cuncnt  bucket  not  full  */ 

auccaaa  >  trua; 

add  (currant,  mag.hay); 

/‘inaem  key  into  current  buffer  */ 
pulbucfiat  (oldpaga.  cunant): 

UnSalactivaLoch  (oldpaga); 

Laa  {/‘current  is  full  •  directory  will  be  rfheted  */ 
auccaaa  ■  apiM  (cunant  haHl,  haH2.  mag.hay); 
/•disiribuies  the  contents  of  the  current  bucket  into 
2  buffien  pointed  to  by  hitfl  and  half2: 
if  room  available,  inserts  key  into  appropriate  half 
«id  returns  true;  otherwise  recums  false  */ 

N  (AvaiUbloPagoaO)  { 
nowpaga  ■  aMochuchatO: 
machina  *  myid; 
putbuchai  (nawpaga.  half2): 

dim  (/*  no  available  pages  (ocally  */ 
dona  ■  Mda; 
whila  (not  dona)  { 

Send  ’Sptltbueket'  message 
containing  contents  of  new  bucket 
to  any  manager  wth  apace; 
whila  (true)  { 

tnaaaaoi'Tl  >  lacajaamaaaaga  (tmag): 

N  (dmaout)  hradt; 

H  (maaangaid  ■  -  aoWfauchatrapiy)  { 
dona  ■  trua; 


g  (emergency  message  about 
pofandaf  parfaar)  broah; 
g  (maaaagald  >  ■  wrongbuohat) 
Sand  'ac**; 

g  (maaugaid  -  -  raquaoi) 

Sand  ‘gwchafdona*; 
g  (amarganey  maaaaga  about 
death  of  aoma0raetory 
manager  or  massage  from 
front  end  about  raeoxery 
of  one)  update  up-bst; 

) 

machina  >  mag-buciiaimgr; 
nawpaga  -  mag-page; 

)  _ 

nmn  .>nant  -  nawpaga: 


putbuchai  (oldpage.  haHi). 
UnSalaclivaLoch  (oldpaga): 
BroadcaalUpdataaO;  /*Sce  below*/ 

) 

g  (auccaaa) 

Send  user  response: 

Send  ’lequest’  maaaaga  to  any  directory 
manager  as  if  It  came  from  user: 


bua: 


And  the  right  bucket  as  m  the  beginning  of  insert 
meept  place  Bxeiusixe  leeks: 


Oamfiongbuchad^apiyO: 
UnExduohmLoch  (oldpaga): 


) 

g  (current  bucket  win  net  be  left  "too  empty’ 
as  a  resuh  of  detaiing  msg.kay)  { 
g  (ramova  (mag-hay.  cunant)) 
pumuchai  (oUpapa.  cunant); 
UnEsduawatocfc  (oldpaga); 


alaa  {  /‘Mergint  panner  buckets  is  called  for*/ 
g  (msg.kay  is  in  hrsi  bucket  of  the  pair)  { 
nawpaga  ■  cunant  •>  na«t: 
machina  m  cunant  ->  rtaxtmgr; 
g  (machina  ■  me)  ( 

Marge  on  sbe: 

M  (/*  panner  is  remote  */ 

Sand  “Matgadown"  message  to 
partner’s  bucket  manager. 

/•  Idergedown  Reply  expected  */ 
whila  (true)  ( 

miisaoi***  ■  racaivamaauga  (bmsg): 
gfUmaout)  { 

magauccasa  >  falsa; 
braalt; 

g  (maaaagaid  ■  Margadownflapiy)  break; 
g  (amarganey  message  about  partner's 
bucket  manager){ 

magauceaaa  -  (alaa; 

«- - «  - 


Deal  with  pessibis  dupdeaies: 

/•as  in  OeiWroogbuckcilUply*/ 

g  (mag  auceaaa)  {/‘OK  to  mar 
(te.  tacaMcpths  maich); 
comcaa  of  panner  in  mig  */ 

Construef  merged  bucket 
In  current  buffer: 

Start  atomic  action  with  partnar; 
puttuchal  (oldpaga.  currant): 

End  atomic  action:  /*Camaiit  protocol*/ 
g  (aborted)  auccaaa  -  Mae; 

/‘if  committed,  partner  w«  be  remomible  for 
propagating  "update"  memagea*/ 

alaa  {/*  simply  remove  record  */ 
g(romo«a(z.  currant)) 

putMickai  (oldpage.  cunant); 


hagi  •>  nanbngr  -  machina; 

/*as  witocn.  Mhirc  prior  to  putbucket 
makes  newpage  gaibage  that  won't  get  cancelled*/ 


} 


UnEachiaivoLock  (Oldpaga): 


I 

!S,  . 


1:^ 


•  *-  **  •■■*.  '-.  h*^**.*.*-  -•  SS^  * 
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) 
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Figure  11  (continued) 


tiM  {  /*  iracke)  in  second  of  pair  */ 
nowpagt  ■  current  >  prev: 
machine  •  currant  ■>  prevmgr: 

UnExcluaiveLock  (oUpage) 
it  (machine  >  •  me)  { 

Merge  on  $itt: 

1 

alee  {/*  paiucr  is  remote  */ 

Send  "Utrtaup"  mesaaga  ro 
perfiwr’f  buehpl  manager; 

/*  MeifeVJp  Reply  expected  */ 
while  (true)  { 

weMaoiid  -  rereivemeeiaee  (tmag); 

I  (limaout)  flairantmit  'Mergeup'*  meeaeg#: 
I  (meeeagairt  -  -  MargaUpReply)  break; 

I  (emergency  meceage  about 
partner ‘f  buc*ef  mtntgar)  { 
mag  euccaaa  ■  falaa: 


t  (emergency  mtssag*  tbout 
death  ot  eome  directory  manager  or 
meeeage  irom  front  end  about 
recovery  of  one)  upM*  up-list: 

Dsai  with  duptiestss: 

1 

d  (tmag-aucceea)  { 

/*  not  meriablc  *  simply  remove  record  */ 
if  (remove  (z,  currant)) 

putbucket  (oWpage,  currant); 

) 

olae  (/*  apparently  mergable  from 

partner's  point  of  view*  check  more  locally  */ 
ExdusivoLock  (oUpage): 
getbuckei  (otdpage.  currant); 

H  (key  to  be  defeted  no  longer 
betongs  in  current  buekat)  { 
UnExcluaivoLock  (oWp^); 

Send  ’Goahead'  massags  to  partnar 
with  aucceaa  ffetd  set  to  falsa: 
/’cancels  metie  */ 


■a  M  (eurrent->toeetdepth 
does  not  match  locaidapth  in  mag  H 
currant  no  longar  ’too  empty*)  ( 
a  (ramova(z.  currant)) 

putbuckiol  (OUpage,  current); 
UnEnduoivoLock  (oUpage); 

Send  "Ooahaad"  masaaga 
with  aucceaa  «  faMe. 

/’caccl  Beige  */ 

} 

•IM( 

Send  eucceeefut  "Ooahaad’ 
neeepe  to  partnar: 

/’MS  poRoer't  OMOUer  to  aterge  */ 
eunaiN  •>  nou  >  eurrani  •>  gear, 
emrent  •>  iiauwgr  - 
ciitraig->gnMBgr. 
eiiitaiN  •>  aommarMi  ■  MoM; 
Start  afonwe  acMen  arPh  partnar; 
piiUuohal  (aUpago.  eurrant): 
end  atomic  action, 
if  (cemmittad)  naadupdale  ■  bua; 


if  (success) 

Send  user  resonse. 

else 

Send  "recuesf"  massage  to  any  directory 
manager  as  H  it  came  from  user. 
if  (neadupdate)  ( 

BroadcaatUpdatasO; 

ExctuawoLock  (oUpage); 
daallecata  (oUpage); 

UnExdusivaLock  (oUpage); 


ClaaiOupllcatesO: 


M  maigadovm:  nawpaga  ■  mogportnar; 

ExduaivaLock  (newpago); 

getbuckal (nawpaga. brother);  ..  ^ 

susetwe  ■  brother  •>  locaUapth  ■  ■  meg.locaUaplh, 
Send  "MergeDown  Reply"  to  partner: 
if  (succaas)  ( 

broStar  ■>  commonbits  ■  dalalad; 
brother  •>  next  >  brother  ■>  prev; 
broStar  ■>  nextmgr  >  brother  ■>  prevmgr; 

Start  atomic  action  with  partner: 
pulbuckat  (nawpaga.  brother); 

End  atomic  action; 
if  (committed)  ( 

UnExdusivaLock  (nawpaga); 
BroadcaaiUpdaiaeO; 

ExduaivaLock  (nawpaga); 
deallocale  (newpago); 


UnExchidvaLock  (nawpaga); 
break; 

BO  maigoup;  nawpaga  •  msg  parmer; 

ExdueivoLock  (nawpaga); 

gaUuckel  (nawpaga.  brofher); 

succoaa  ■  (brother  ->  next  -  •  msg.target)  SS 

(brother  •>  naxtmgr  ■  ■  mag.mano(jerid); 

Send  'UergeUp  Reply": 

IKsucoasB)  ( 

/•  XoAhead"  expected  ’/ 
edde  (hue)  { 

maaea^  •  receivamoaaaga  (Smog); 

f  (limaout)  mog  succoB  -  Mae; 

•  (miiiipVl  -  ■  (ioAhaod)  break; 
f  (emergency  message  about  partner's  manager)  ( 
magaucooaa  ■  Mae; 


S  (moaaagaU  ■  ■  Merge  Up)  /’dupHcsK*/ 
flotranamii  ’fdargaUpRaply": 


Bugauccooe)  {/’ merge  */ 

Oanstruef  matgad  buebar  in  brother; 
SUrt  atomic  action  wM  partnar; 


End  atomic  action; 


UnExdudvoLack  (nawpaga); 
ClMfOupliMlMO  • 


)  )}))}) 


Figure  11  (continued) 


BroadcastUpdatMO 

{ 

notdone  >  true; 

whil*  (noWonv)  { 

S»nd  “update"  masaagas  to  all 
diraeiory  managars  on  up-lisi: 
whUaftrua)  { 

wawagaid  -  racatvamMugaitmao); 

M  (timaout)  braak: 

U  (maaaagatd  >  -  updataack) 

Afar*  that  managar  on  up-iiat 

if  (rafavani  amarganey  maaaaga) 
updata  up-liat; 

d  (raco¥arad  diraeiory  managar)  { 
update  up-liat; 

Sand  "update’  maaaaga: 

} 

if  (maaaagaid  -  -  wrongbuckal)  /‘duplicauv 
Sand  “aek": 

a  (maaaagaid  ■  ■  raquaat)  /'duplicate*/ 
/fSSand  'buckatdona': 

if  (all  diractory  managara  accouniad  for)  { 
dona  -  trua: 
braak: 

} 

/'Ignore  irrelevant  etnerg.  imgs*/ 

) 

> 


OatWrongbuckalRaplyO 

{ 

wbilo(trua)  ( 

maaaagaid  ■  racaivameaaage  (&msg): 

H  (bmaoiit)  Itairanamii  "wrongbuckal"  maaaaga: 

M  (maaaagaid  -  *  WrongbuckatRepiy)  break; 

if  iamarganey  maaaaga  about  na*i  bucket  managar)  { 
aand  uaar  a  failure  raaponaa: 
braak; 

) 

if  (maaaagaid  -  ■  nvrongbuckat)  /'duplicate*/ 

Sand  "aek"; 

H  (maaaagaid  -  >  raquaat)  /'duplicate*/ 

/ISSand  'buckaidone'; 

) 

) 

CiaarOuplicataaO 

iMhila  (any  maaaagaa  pending)  { 

maaaagaid  ■  racaivomaaaaga  (Amag); 

awitch  (maaaagaid)  { 

caaa  wrortgbuckal:  Sand  “aek "; 

caaa  laquaal;  Sand  "buckaidona"; 

caaa  apMbuckatraply;  /*  only  possible  in  iaaert"/ 

H  (from  other  than  choaan  partner) 

Sand  "cancaiaplit": 

/'Other  possible  duplicates 
(04.  Mergeup.  MergeUpReply) 
require  no  action*/ 
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4.  Conclusions 

In  this  paper,  we  have  presented  a  solution  for  distributing  an  extendible  hash  file 
with  replication  of  the  directory  comi^nent  of  the  structure.  The  solution  is 
interesting  in  its  own  ri^t  for  use  in  a  distributed  data  base  system  that  is  expected 
to  frequently  change  size  and  be  available  from  various  points  in  the  network. 

The  solution  also  serves  to  illustrate  several  points  that  may  apply  to  other 
problems  that  can  be  viewed  primarily  as  data  structures  to  be  partitioned  and 
possibly  replicated  across  sites  of  a  distributed  system.  The  first  point  concerns  the 
features  of  sequential  data  structures  that  make  diem  amenable  to  distribution.  In 
this  study,  we  chose  a  shallow  (2*level)  linked  structure  as  a  starting  point  For 
comparison,  we  can  consider  the  deeper  structure  such  as  a  B*tree  or  a  logically 
contiguous  one  sudi  as  linear  hashing  [Litwin  80].  First  of  all,  a  multilevel  linked 
structure  offers  several  advantages.  The  links  map  naturally  onto  a  port-based 
communication  mechanism  and  the  indirection  provided  by  the  directory  allows 
flexibility  in  assigning  buckets  to  sites.  It  is  esp^ally  convenient  if  the  top  level 
component  is  reasonably  small.  In  our  case  this  allows  the  hash  function  to  calculate 
a  location  in  the  addresss  space  belonging  to  a  single  logical  processor.  By  contrast, 
linear  hashing  lacks  the  directory  component  and  therefore  requires  that  a  naming 
convention  1^  adopted  to  give  the  r^pearance  of  a  network-wide  address  space 
impropriate  for  direct  calculation  of  bucket  locations.  Of  course,  if  the  directory 
outgrows  a  single  manager,  extendible  having  requires  a  similar  convention. 

The  major  complexity  of  our  solution  arises  from  the  replication  of  the  directory 
to  enhance  availability.  Although  the  absence  of  a  directory  in  the  linear  hashing 
scheme  seems  at  first  glance  to  provide  availability  easily,  there  is  a  small  set  of  data 
required  for  bucket  address  calculation  that  should  be  replicated.  In  the  naive 
solution,  this  information  should  also  be  accurate,  suggesting  a  need  for  strict 
synchronization  among  copies.  Thus  eliminating  the  directory  component  does  not 
trivialize  the  problem  as  some  researchers  have  claimed.  The  shallowness  of  our 
multilevel  structure  is  an  asset  in  that  the  short  average  search  path  makes  an  optimal 
assignment  of  buckets  to  managers  relatively  unimportant  For  a  deeper  structure 
such  as  a  B-tree,  one  might  want  to  address  the  hard  problem  of  grouping  pages 
within  servers  to  improve  locality. 

The  second  point  demonstrated  by  our  solution  is  the  value  of  making 
modifications  in  die  implementation  of  foe  data  structure  that  allow  recovery  finm 
foe  use  of  inconsistent  information  (e.g.  nat  links)  and  improved  locality  (e.g.  prev 
links).  Ihere  are  opportunities  for  taking  fois  idea  even  further  in  foe  solution 
presented. 

Another  point  has  to  do  with  methodology.  Developing  a  distributed  solution 
raises  a  number  of  issues;  alfoou^  some  are  unique  to  fois  particular  model  of 
computation,  foe  aspea  of  achieving  a  degree  of  concurrency  is  common  to  both 
distributed  and  shar^  data  systems.  Thus  a  correa  centralized  solution  should  prove 
to  be  a  good  starting  point  in  determining  how  to  partition  structured  data.  The 
approach  successfiilly  used  here  was  to  first  solve  foe  problem  of  concurrent  access 
and  then  use  that  result  as  foe  basis  for  distributing  foe  computation. 


•  *** 


Finally,  it  bears  repeating  that  a  fundamental  characteristic  of  a  distributed 
system  is  the  impracticality  of  gathering  a  true  instantaneous  global  view  of  the 
world.  Successful  distributed  applications  must  be  able  to  accommodate  inconsistent 
and  inaccurate  information. 
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